AITopics | attention-only transformer

Collaborating Authors

attention-only transformer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Transformers on Markov data: Constant depth suffices

Neural Information Processing SystemsFeb-18-2026, 18:25:42 GMT

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities.

artificial intelligence, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Communications (0.68)

Add feedback

Transformers on Markov data: Constant depth suffices

Neural Information Processing SystemsOct-10-2025, 21:55:20 GMT

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities.

attention-only transformer, markov process, transformer, (16 more...)

Neural Information Processing Systems

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Communications (0.68)

Add feedback

Attention-Only Transformers via Unrolled Subspace Denoising

Wang, Peng, Lu, Yifu, Yu, Yaodong, Pai, Druv, Qu, Qing, Ma, Yi

arXiv.org Artificial IntelligenceJun-5-2025

Despite the popularity of transformers in practice, their architectures are empirically designed and neither mathematically justified nor interpretable. Moreover, as indicated by many empirical studies, some components of transformer architectures may be redundant. To derive a fully interpretable transformer architecture with only necessary components, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. To compress these noisy token representations, an associated denoising operation naturally takes the form of a multi-head (subspace) self-attention. By unrolling such iterative denoising operations into a deep network, we arrive at a highly compact architecture that consists of \textit{only} self-attention operators with skip connections at each layer. Moreover, we show that each layer performs highly efficient denoising: it improves the signal-to-noise ratio of token representations \textit{at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on vision and language tasks demonstrate that such a transformer achieves performance close to that of standard transformer architectures such as GPT-2 and CRATE.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.0379

Country: North America > United States (0.46)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Transformers on Markov Data: Constant Depth Suffices

Rajaraman, Nived, Bondaschi, Marco, Ramchandran, Kannan, Gastpar, Michael, Makkuva, Ashok Vardhan

arXiv.org Artificial IntelligenceJul-24-2024

Attention-based transformers have been remarkably successful at modeling generative processes across various domains and modalities. In this paper, we study the behavior of transformers on data drawn from \kth Markov processes, where the conditional distribution of the next symbol in a sequence depends on the previous $k$ symbols observed. We observe a surprising phenomenon empirically which contradicts previous findings: when trained for sufficiently long, a transformer with a fixed depth and $1$ head per layer is able to achieve low test loss on sequences drawn from \kth Markov sources, even as $k$ grows. Furthermore, this low test loss is achieved by the transformer's ability to represent and learn the in-context conditional empirical distribution. On the theoretical side, our main result is that a transformer with a single head and three layers can represent the in-context conditional empirical distribution for \kth Markov sources, concurring with our empirical observations. Along the way, we prove that \textit{attention-only} transformers with $O(\log_2(k))$ layers can represent the in-context conditional empirical distribution by composing induction heads to track the previous $k$ symbols in the sequence. These results provide more insight into our current understanding of the mechanisms by which transformers learn to capture context, by understanding their behavior on Markov sources.

induction head, markov process, transformer, (15 more...)

arXiv.org Artificial Intelligence

2407.17686

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.50)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.46)

Add feedback

Attention-Only Transformers and Implementing MLPs with Attention Heads

Huben, Robert, Morris, Valerie

arXiv.org Artificial IntelligenceSep-15-2023

The transformer architecture was introduced in the landmark 2017 paper Attention is All You Need (Vaswani et al., 2023) and traditionally consists of alternating attention and multilayer-perceptron (MLP) sublayers. Although initially used for machine translation, transformers have been used across a wide range of tasks, including language modeling (Radford et al., 2018; Devlin et al., 2019; Liu et al., 2018), computer vision (Khan et al., 2022; Cornia et al., 2020), and image generation (Parmar et al., 2018). The widespread deployment of transformers has led to increasing interest in mechanistic interpretability (Wang et al., 2022; Conmy et al., 2023), which seeks to convert the computations of transformers into human-understandable explanations. Some interpretability efforts, such as Elhage et al. (2021), focused on attention-only transformers, finding that MLP layers were harder to interpret. This work seeks to supplement those mechanistic interpretability methods by showing that MLP layers in transformers are equivalent to a sum of masked attention heads and therefore can be subjected to interpretability techniques that work on attention-only transformers. In Theorem 3 we show that by including a "bias token" akin to the persistent memory vectors in Sukhbaatar et al. (2019) and using a slightly unusual attention-masking pattern, an MLP layer of size l can be written as the sum of l attention heads with internal dimension 1. We show in Theorem 6 that one can apply this process throughout the entire transformer, converting the typical MLP-and-attention transformer into an attention-only transformer. We then show in Theorems 7 and 8 that attention heads can implement row-wise linear transformations and matrix-level activation functions separately. Finally, we show in Theorem 9 that a slightly augmented network is capable of approximating any masking pattern to within arbitrary error.

attention head, matrix, transformer, (14 more...)

arXiv.org Artificial Intelligence

2309.08593

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Perceptrons (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

Add feedback